Autonomous localized path planning algorithm for UAVs based on TD3 strategy

Unmanned Aerial Vehicles are useful tools for many applications. However, autonomous path planning for Unmanned Aerial Vehicles in unfamiliar environments is a challenging problem when facing a series of problems such as poor consistency, high influence by the native controller of the Unmanned Aerial Vehicles. In this paper, we investigate reinforcement learning-based autonomous local path planning methods for Unmanned Aerial Vehicles with high autonomous decision-making capability and locally high portability. We propose an autonomous local path planning algorithm based on the TD3 strategy to solve the problem of local obstacle avoidance and path planning in unfamiliar environments using autonomous decision-making of Unmanned Aerial Vehicles. The simulation results on Gazebo show that our method can effectively realize the autonomous local path planning task for Unmanned Aerial Vehicles, the success rate of path planning with our method can reach 93% under the interference of no obstacles, and 92% in the environment with obstacles. Finally, our method can be used for autonomous path planning of Unmanned Aerial Vehicles in unfamiliar environments.

RL Intelligent Body (RL Agent) fuses the image information output from the target detection module with the altitude information provided by the sensors and inputs it to the RL Agent and finally outputs advanced control information to the controller of the cooperative operating system so that the UAV can efficiently and accurately complete the local path planning task.The algorithms in this paper are useful when adapting to multiple types of UAVs (with different underlying controllers) or when the UAV is equipped with only a monocular camera and a rudimentary LIDAR or when the UAV's onboard computing device is not GPU-accelerated, or when all three of the above characteristics are present for the localized path planning work.The contributions of this paper are as follows: • Proposing an RL strategy algorithm for the system for autonomous local path planning.With the help of this algorithm, it is possible to realize a capability that has a high degree of portability and allows the UAV to make autonomous decisions and realize path planning without changing the native controller of the UAV; • Includes a robot advance controller system for the Unmanned Aerial Vehicles' flight control.With this system, the underlying flight controllers of various UAVs can be better adapted to better improve the portability of the algorithms in this paper; • Proposing a Gazebo simulation environment to test the algorithms in this paper, which includes the training environment and testing environment to be suitable for the RL agent training.

Related works
With the development of deep reinforcement learning (DRL), there is an increasing number of various works on reinforcement learning applied to robotics 5,6,9 .Reinforcement learning has been successful in the field of FPV racing sports as well 10,11 .And for or trajectory optimization, a series of effective strategies have also been proposed in recent years to solve the trajectory optimization problems [12][13][14][15][16][17] .For quadcopters, there are also some excellent algorithms that have been proposed in the recent past to enable autonomous UAV deer in unfamiliar environments 7,8 , which can also show good results in cluster operation 18 .With the continuous development and advancement of real-time target detection 19,20 , cameras 21 , lidar, and deep reinforcement learning algorithms [22][23][24] , the cost of deploying reinforcement learning to robots is gradually decreasing and its feasibility is gradually increasing, and a lot of work on integrating DRL with robotics has emerged.In Ref. 9 , the target point is defined as a semantic network result and the semantic image is fused with the depth image.Finally, the results are fed into the RL strategy for decision-making, which ultimately results in a fast and accurate emergency landing of the UAV.In 6 , the authors used random coordinates as the form of the target point for the UAV.After the unmanned vehicle acquires the target point, according to the distance information returned by the LiDAR, the information is fused with the car's own sensor data, and then input into the RL Agent to output the decision, realizing the autonomous navigation function of the unmanned vehicle.Although these works have yielded good results, most of them require depth information to solve for the relative positions of the target points or assume that the relative positions of the target points have been obtained by other methods.
In this paper, we present an autonomous local path planning method that addresses the challenge of UAV path planning with only image data.Our approach is based on the deep reinforcement learning (DRL) algorithm and solves the issue of inaccurate relative positioning when the UAVs are at different altitudes and there is no target depth information available.The proposed method enables UAVs to efficiently and accurately reach their targets and complete localized path-planning tasks.
The remainder of this paper is organized as follows.The research direction of this thesis was introduced in Introduction.The work related to reinforcement learning and path planning was described in "Related works".The issues encountered and the system model is described in "Problem description and system modelling".The path planning algorithm of this paper is presented in "Path planning algorithm".The relevant data and results of the experiments are given in "Result analysis".Finally, The conclusions and future works are drawn in "Conclusion and future work".

Problem description and system modelling
This paper is dedicated to solving the problem of designing decision-making models to help UAVs realize local path planning work in unfamiliar environments based on environmental information without changing the native controller of the UAV.This paper formulates it as a deep reinforcement learning problem, in which the RL Agent extracts preprocessed environmental information and sensor information for training, so that the RL Agent can utilize the environmental information to enable the UAV to accurately find the relative position information and make the path planning in the absence of GPS signals as well as target depth information.
In view of the introduction of yolov7 25 , this paper adapts its interface to the RL Agent, which greatly reduces the training resources, and can be deployed on the cpu platform to complete the training phase.To train the model, this paper uses a realistic physical model provided by px4 26 .The advanced control commands are outputted to the HWT_OS system, which can be quickly deployed to the real UAV environment after fine-tuning the parameters on the basis of the simulation training and when the camera internal parameters are kept consistent.

RL agent system model
As shown in Fig. 1, the environment information is transmitted to the data preprocessing module from the simulation environment or the real machine.In order to improve the training speed, the work in this paper does not hand over the image information, sensor information, and lidar information to the RL Agent for processing, but the data are handed over to the data preprocessing module for preprocessing.The data preprocessing and fusion module preprocesses and fuses the image information, sensor information, and lidar information as the environment information to be processed by the RL Agent, which makes corresponding planning according to

HWT_OS system
HWT_OS is a set of UAV-implemented cluster control system established at the beginning of this work.As shown in Fig. 2, the unified information data structure as well as the communication protocols allow the newly added robots to be more quickly integrated into the existing cluster control system.The work in this paper uses only parts of its advanced controller for better control of UAVs and uses its env modules to provide interfaces to adapt RL algorithms to his calls.The UAV advanced controller module of HWT_OS integrates the interfaces of PX4-MAVLINK-ROS-MAV-ROS and provides additional ENU, body, etc. for controlling the speed, position, and acceleration of the UAV in different combinations under multiple coordinate systems, and provides interfaces on the ROS side to enable external operations and their convenient use of commands to control the UAV in multiple coordinate systems to different trajectories movement in multiple coordinate systems.The env modules are specially designed to provide an interface for gazebo simulation, integrating the advanced controller interface of HWT_OS, which can be uniformly delivered to gazebo for manipulating the robot's initialization, movement, and obstacle changes.

Path planning algorithm
The RL Agent uses the Actor-Critic model to map the preprocessed environmental information to the UAV actions, and its architecture is shown in Fig. 3.In it, each action includes a high-level control instruction (control mode (velocity or displacement), control coordinate system (ENU system or body system), value (velocity value or displacement value)), and each set of values can be selected among the specified thresholds.In the following, we describe in more detail the individual modules of the algorithm.

Algorithmic architecture
Actor and Critic networks share the environment information provided by the data preprocessing and fusion module.The Actor network consists of three fully connected (FC) layers, the last part of which outputs a set of 4-dimensional actions, and in order to make the output action intervals in the simultaneous existence of positive and negative value selection, the activation function of the last FC layer selects the tanh function; the Critic network consists of two networks composed of four FC layers respectively, and both networks output one-dimensional Q-value values, and the RL Agent adopts the policy of TD3 24 to update parameters.The Critic network consists of two networks consisting of 4 FC layers, both of which output one-dimensional Q-value values, and the RL Agent adopts the strategy of TD3 24 to update the parameters, one of which is used to send the Q-value to the Actor network to update the parameters of the Actor network, and the other one is used to update the parameters of the Critic itself in batches during the iteration process.At the same time, in order to improve the training speed and accelerate the training speed of the CPU side, this paper does not add the convolutional layer in the RL training network, and all the preprocessing process of the environmental information is handed over to the data fusion layer and reduces the redundancy degree of the data in the data fusion layer in order to improve the training speed.

Data preprocessing and fusion
The data preprocessing and fusion module provides appropriate state information to the RL Agent to better adapt the reinforcement learning algorithm, which serves to accelerate the training while also better preparing for future expansion and porting to other robotics platforms.
This module integrates yolov7 outputs data, lidar data of UAV, and sensor data of UAV.The three are processed and fused to get the STATE information and provided to the RL Agent.
For reinforcement learning algorithms for simulation applications, each time step takes longer to complete compared to other training algorithms.Directly outputting redundant information such as image and lidar information to the RL Agent can significantly increase the time required for each time step and also prolong the fitting time for training.In the fifth part of this paper, we will compare the experimental data of outputting the redundant information and the fused information.
In order to solve this problem, this paper uses the yolov7-tiny model, and single-step detection as the method, cancels the real-time monitoring to improve the training speed, provides an interface, each time the data fusion is performed, calls the interface to get the yolov7-tiny output image as well as the tensor vector.Only the part of the vector containing coordinate information and category information is intercepted, and a set of 5-dimensional tensor-img vectors is obtained after fusion, and the image information is used to visualize the observation and comparison.In this paper, the same data fusion method is used for the sensor information, and only the current altitude information is extracted for its own basic information, yaw angle information and lidar data.Lidar information has the highest redundancy among the sensor information, considering that the RL Agent adopts single-step planning, this paper, on the basis of eliminating the inf and other useless data, additionally eliminates all the values other than the shortest distance direction, and only retains the shortest distance and the shortest distance direction, so as to reduce the redundancy of the information, and to improve the convergence speed.

Rewards
There are Markovian properties for single-step planning of local path planning, so in this paper, each step of path planning is considered as a Markovian decision-making process.The training of the RL Agent is based on the data and rewards collected at each decision.At each decision, the total reward is composed of different terms as follows: where R t is the final reward, R n is the step reward, R s is the single step reward, R c is the collision reward, and W n , W c are the weights of the step reward as well as the collision reward, respectively.
The final reward R t refers to a fixed 100-point reward given to the UAV when it reaches the point directly above the target point.
The aim is to encourage the drone to reach directly above the goal point without collision.
The step reward R n refers to the ratio between the maximum number of steps the drone can move and the number of steps it is currently moving under the current curtain.This reward's encourages the drone to travel faster to the target point, while improving the convergence speed of the algorithm.
The single-step reward, R s , is the reward that changes most frequently at each time step, which calculates whether the planning of this step keeps the drone close to the target point or not.X P , Y P are the coordinate values of the target center in the image coordinate system, X Pc , Y Pc are the coordinate values of the center point of the image, W p is the weights of the pixel error, X E , Y E are the coordinate values of the target center at the target point in the ENU coordinate system, X Ec , Y Ec are the coordinate values of the current position of the UAV in the ENU coordinate system, and We is the weights of the ENU error.Adding the correction under the ENU system in the but-step reward can increase the convergence speed and reduce the training cost.This reward is mostly negative reward plays the role of encouraging UAV exploration.
The collision reward R c refers to the UAV's penalty for the current curtain in the event of a collision.

Network update strategy
Reinforcement learning discusses the problem of how an agent can maximize the rewards it can obtain inside a complex and uncertain environment.By sensing the STATE of the environment it is in and the rewards REWARD obtained from the action chosen at that moment, the parameter network is updated to obtain a better REWARD.The TD3 algorithm adopted in this paper draws on the training idea of Double DQN 22 , utilizes the Actor network and Critic network in the AC algorithm, and the basic framework of the algorithm is shown in Fig. 4, which is combined with the Bellman equation to realize the autonomous local path planning of the UAV under continuous motion.
In Fig. 4, in TD3 algorithm, Actor network Actor generates the strategy, and Critic network Critic evaluates and improves the current strategy by estimating the dominance function An.Both are optimized according to the strategy gradient, and the updating formulas for the parameters θ(a) of Actor network, and the parameters ϕ(c) of Critic network are as follows: The gain function in the above equation values the Actor strategy and calculates the gain expectation R 0 obtained by the agent with state transfer and action selection after adopting this strategy.
The maximized gain function is as follows: Critic calculates the earnings valuation function as follows: (1)

Simulation experiments in accessibility
In order to quantitatively evaluate the settings of the selected parameter configurations, this paper compares different environmental information, single-step reward noise values, and action noise values in an indoor environment.Under the premise that the training time is set to 8 h and the single-step reward noise value is set to 0.1, the training results of excluding the data preprocessing module are compared with those of adding the data preprocessing module; without limiting the training time, the training results of adding the data preprocessing module and setting the single-step reward noise value to 0.1 are compared with those of 0.5; without limiting the training time and adding the data preprocessing module, the single-step reward noise value is set to 0. As shown in Fig. 5a, when the state information preprocessing module is not intervened, the RL Agent is unable to make proper path planning actions, and the maximum Q value converges to a large negative value, while the intervention of the state information preprocessing module reduces the redundancy of the state information, ( 8) so that there exists a chance for the RL Agent to reach the target point, but due to the pre-training period, the reward noise setting for it is too small, which inhibits its exploration process, so that the UAV exploration converges to the local optimum, and cannot make reasonable path planning actions in the edge zone; when the single-step reward noise value is adjusted higher, the convergence time of the maximum Q value is pushed back, but the average Q value converges in a higher interval, and the maximum Q value stabilizes at the maximum reward, so that the UAV can make suitable planning actions in the edge zone of the map to move towards the target point; when the altitude interval makes a change, the altitude when the height interval is changed, a bigger change in the interval means that the zoom ratio of the image becomes bigger, and the planning difficulty increases.As shown in Fig. 5, the average Q value and the maximum Q value also converge at 7h, and the convergence values are similar.

Simulation experiments with obstacles
In order to verify the integrity of the algorithm local path planning, this paper adds 2 groups of obstacles in the indoor environment, and selects unrestricted training time, adding STATE information preprocessing module, single-step rewarding noise value set to 0.5, and initialization height threshold set to [4-6 m] as the baseline, and meanwhile, in order to increase the algorithm retrospective, the maximum number of steps in a single act is elevated to 10, and in the case of enabling lidar data, the environment augmented obstacle and unaugmented obstacle training results for comparison, according to the training results to calculate the average Q value, the maximum Q value.
As shown in Fig. 5b, this paper re-experiment after the subsequent incremental obstacles, Ave.Q Value of the beginning of the convergence time is significantly reduced, its planning action when encountering obstacles to make its points lower, but the final convergence effect is basically the same, in the value of 50 to achieve convergence.Q Value belongs to the decision made by the UAV in the last planning step, at this time, due to the close to the target point and away from the obstacles, so the convergence speed of Max.Q Value in the case of obstacles and no obstacles of the two do not have a big difference and ultimately converge to the maximum score.
As shown in Table 1, under the standard condition, i.e., no obstacles, the map side length is 5 m * 5 m, no limitation of training time, adding the data preprocessing module, setting the single-step reward noise value to 0.5, and the maximum number of exploring steps is 5, and the initial position of the UAV is randomized in each round; after 500 rounds of testing of the Agent, the UAV successfully achieves the success rate of path planning is 93%, which basically achieves the experiment of this paper.The purpose of this paper is to realize the autonomous local path planning task in unfamiliar environments, and in this map, only 1.73 steps are needed on average to achieve the path planning task; after adjusting the altitude interval, the same 500 rounds of Agent testing with random initial position of the UAV are also carried out, and the success rate and the successful average number of steps basically remain the same; this paper adopts the high-dimensional state space after isolating the data preprocessing module.space, the success rate is only 3%, which cannot realize the path planning task; when the single-step noise value is set to 0.1 to make it less exploratory, the success rate is only 13%, which is due to the fact that the UAV is trapped in a local optimum that it cannot get rid of during path planning.

With single obstacles and multiple consecutive obstacles
In this paper, this model is tested in single-obstacle and multiple-obstacle environments with random locations and random densities, respectively.As shown in Table 1, when a randomly located single obstacle is added to the environment, the UAV's planning success rate is consistent with the performance of the no-obstacle environment, both achieving a 93% success rate.And from Fig. 6a, it can be seen that at this time the difference between the optimized paths of this paper and similar algorithms is not much, and both achieve good path planning work.And after deploying the model to the simulation environment with dense multiple obstacles added, as shown in Table 1, the success rate of planning can reach 87%, which is slightly lower compared to the single obstacle environment, but the number of planning steps is improved to 8.31.As can be seen from Fig. 6b, when the UAV plans a path to the middle of the multiple obstacles, there is a phenomenon that waits in place in order to minimize the penalty of the reward function, which is an important reason leading to the This is an important reason for the reduced success rate of UAV path planning.In contrast to similar algorithms, the algorithm in this paper tends to make the decision to move the UAV towards areas far away from obstacles before proceeding to reduce the risk of collision, while other algorithms tend to move the UAV away from the risk of collision without increasing the distance traveled by the UAV as much as possible.
With dynamic obstacles Our model's decision-making success rate was only 61% when deployed in a simulation environment with dynamic obstacles, as shown in Table 1.Our analysis indicates that there are two main factors responsible for the significant reduction in planning success rate in the dynamic obstacle environment.
Firstly, the single-step planning approach of the algorithm, which treats each planning step as relatively independent, makes it impossible to accurately predict the trajectory of dynamic obstacles.Secondly, the real-time planning nature of the algorithm is not sufficient to continuously predict the dynamic obstacles and achieve the desired level of decision-making.These are the two primary reasons why the model fails to produce good results in dynamic obstacle environments.www.nature.com/scientificreports/

Experimental results
Based on the experiments conducted, this paper's algorithm has shown impressive performance in scenarios without obstacles, a static single obstacle, and multiple continuous obstacles.Compared to other UAV path planning algorithms, this algorithm has greater portability as it supports the HWT_OS system.This means that the model can be directly deployed to other UAVs without changing the UAV's native controller.Additionally, this algorithm leverages the RL Agent, which enables the UAV to make autonomous decisions at a higher level and is comparable to traditional path planning algorithms.Furthermore, it tends to move the UAV away from all obstacles before moving towards the target direction.However, the algorithm in this paper has limitations when it comes to dynamic obstacle environments.The reason is that the algorithm mainly uses the Markov decision process for discontinuous decision-making, which limits the continuous decision-making speed of the UAV when facing dynamic obstacles.Consequently, this algorithm's planning effectiveness is poor in such scenarios.

Conclusion and future work
This paper presents an autonomous local path planning algorithm for UAVs, based on the TD3 algorithm.The algorithm can help UAVs quickly and accurately plan their path in unfamiliar environments.Additionally, this algorithm is highly portable, thanks to the HWT_OS system proposed in this paper.It can work in other UAV devices by modifying only some of the parameters, without changing the native controller of the UAV.And we will further improve the decision speed of the reinforcement learning agent in the future to improve decision-making in dynamic obstacle environments and expand it to set UAV planning scenarios (Supplementary Informations S1 and S2).

Figure 1 .
Figure 1.Block diagram of RL agent system.
5 and the initialized height threshold is set at [0.5]. to 0.5, and compare the training results with initialized height thresholds at [4m-6m] and [4m-12m].Based on the training results, the average Q value, the maximum Q value and the convergence time are calculated.

Figure 5 .
Figure 5. Max.Q and Ave.Q for simulation experiments.

Figure 6 .
Figure 6.The trajectory plot of 2-dimensional space for UAVs.

Table 1 .
Planning success rate under each condition.Significant values are in bold.